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Abstract 

Penalization procedures often suffer from their dependence on multiplying factors, whose 
optimal values are either unknown or hard to estimate from data. We propose a completely 
data-driven calibration algorithm for these parameters in the least-squares regression frame- 
work, without assuming a particular shape for the penalty. Our algorithm relies on the 
concept of minimal penalty, recently introduced by Birge and Massart (2007) in the con- 
text of penalized least squares for Gaussian homoscedastic regression. On the positive 
side, the minimal penalty can be evaluated from the data themselves, leading to a data- 
driven estimation of an optimal penalty which can be used in practice; on the negative 
side, their approach heavily relies on the homoscedastic Gaussian nature of their stochastic 
framework. 

The purpose of this paper is twofold: stating a more general heuristics for designing 
a data-driven penalty (the slope heuristics) and proving that it works for penalized least- 
squares regression with a random design, even for heteroscedastic non-Gaussian data. For 
technical reasons, some exact mathematical results will be proved only for regressogram 
bin-width selection. This is at least a first step towards further results, since the approach 
and the method that we use are indeed general. 

Keywords: Data-driven Calibration, Non-parametric Regression, Model Selection by 
Penalization, Heteroscedastic Data, Regressogram 



1. Introduction 



In the last decades, model selection has received much interest, commonly through pe- 
nalization. In short, penalization chooses the model minimizing the sum of the empir- 
ical risk (how well the algorithm fits the data) and of so me measure of complexity of 



the mo del (called pen alty); see FPE (lAkaikel . Il97ch . AIC dAkaikel . ll973T ). Mallows' C, 
or C L dMallowsl . fl973l ). Many other pe nalization procedure s have been proposed since, 



among ^irl.demacher complexities feoltchinskii lioOll: liartlett et all B. local 



Rademacher complexities ( Bartlett et al. . 2005 : Koltchinski . 20061 ) . bootstrap penalties 



(jEfronl . ll983T ). resampling and V-Md penalties fjArlotl . l2008rJ ld\ 



Model selection can target two different goals. On the one hand, a procedure is efficient 
(or asymptotically optimal) when its quadratic risk is asymptotically equivalent to the risk 
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of the oracle. On the other hand, a procedure is consistent when it chooses the smallest 
true model asymptotically with probability one. This paper deals with efficient procedures, 
without assuming the existence of a true model. 

A huge amount of literature exists about effici ency. First Ma llows' C p , Akaike's FPE 
and AIC are asymptotically optimal, as proved by IShibatal (Il98ll ) f or Gaussian errors, by 
H (1987 ) under suitable moment assumptions on the errors, and by Polvak and Tsvbakov 
(1990) under sharper moment conditions, in the Fourier case. Non- asymptotic oracle in - 
equaliti es (with some lead i ng con stant C > 1) have been obtai ned by iBarron et al.l ( 19991 ) 
and by Birge and Massart ( 200ll ) in the Gaussian case, and by Baraudl (j2000l . 12002 ) under 
some moment assumptions on the errors. In the Gaussian case, non-asymptotic oracle in- 
eq ualities with leading const ant C n tending to 1 when n tends to infinity have been obtained 
by iBirge and Massart! (|2007l ). 



However, from the practical point of view, both AIC and Mallows' C p still present 
serious drawbacks. On the one hand, AIC relies on a strong asymptotic assumption, so 
that for small sample sizes, the optimal multiplying facto r can be quite different from one . 
Ther efore, corrected versions of AIC have been proposed (jSugiural . Il978l : iHurvich and Tsail . 
1989T ). On the other hand, the optimal calibration of Mallows' C p requires the knowledge 
of the noise level <7 2 , assumed to be constant. When real data are involved, a 2 has to be 
estimated separately and independently from any model, which is a difficult task. Moreover, 
the best estimator of a 2 (say, with respect to the quadratic error) quite unlikely leads to 
the most efficient model selection procedure. Contrary to Mallows' C p , the data-dependent 
calibration rule defined in this article is not a "plug-in" method; it focuses directly on 
efficiency, which can improve significantly the performance of the model selection procedure. 

Existing penalization procedures present similar or stronger drawbacks than AIC and 
Mallows' C p , often because of a gap between theory and practice. For instance, oracle in- 
equa lities have only be en proved for (global) Rademacher penal ties mul t iplied by a factor 
two (Ko 



tchin skiil . l200ll ) , while they are used without this factor (lLozanol .r2000). As proved 



by I Arlotl (|2007l . Chapter 9), this factor is necessary in general. Therefore, the optimal cali- 
bration of these penalties is really an issue. The calibration problem is even harder for local 
Rademacher complexities: theoretical results hold only with large calibration constants, 
particularly the multiplying factor, and no optimal values are known. One of the purposes 
of this paper is to address the issue of optimizing the multiplying factor for general-shape 
penalties. 

Few automatic calibrat ion algorithms are availab le. The most popular ones are certainly 
cross-validatio n methods (Align, 1974 ; Stone . 19741 ). in particular l/-fold cross-validation 
(|Geisserl . ll975l ). because these are general-purpose methods, relying on a widely valid heuris- 
tics. However, their computational cost can be high. For instance, F-fold cross-validation 
requires the entire model selection procedure to be performed V times for each candidate 
value of the constant to be calibrated. For penalties proportional to the dimension of the 
models, such as Mallows ' C p , al ternative calibration procedures have been proposed by 



George and Foster! l|200Gl ) and bv lShen and~Yel rt2002T ). 

A completely different approach has been proposed by Birge and Massart ( 20071 ) for 
calibrating dimensionality-based penalties. Since this article extends their approach to a 
much wider range of applications, let us briefly recall their main results. In Gaussian 
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homoscedastic regression with a fixed design, assume that each model is a finite-dimensional 
vector space. Consider the penalty pen(m) = KD m , where D m is the dimension of the 
model m and K > is a positive constant, to be calibrated. First, there exists a minimal 
constant K m [ n , such that the ratio between the quadratic risk of the chosen estimator 
and the quadratic risk of the oracle is asymptotically infinite if K < K m { n , and finite if 
K > -fT m i n . Second, when K = K* := 2K m i n , the penalty KD m yields an efficient model 
selection procedure. In other words, the optimal p enalty is twice the minima l penalty. This 



relationship characterizes the "slope heuristics" of lBirge and Massartl (|2007l ). 

A crucial fact is that the minimal constant K m [ n can be estimated from the data, since 
large models are selected if and only if K < K m [ n . This leads to the following strategy for 
choosing K from the data. For every K > 0, let rh{K) be the model selected by minimizing 
the empirical risk penalized by pen(D m ) = KD m . First, compute K m \ n such that Dfh{K) is 
"huge" for K < K m \ n and "reasonably small" when K > K m \ D ; explicit values for "huge" 
and "small" are proposed in Section 13.31 Second, define m := m(2 Kmm)- Such a me thod 



has been successfully applied for multiple change points detection by Gba7bj3 » 



From the theoretical point of view, the issue for understanding and validating this 
approach is the existence of a minimal penalty. This question has been addressed for 
Gaussian homoscedastic regression w ith a fixed des i gn by Birge and Massart ( 2001 . 20071 ) 
when the variance is known, and by iBaraud et al.l ( 20071 ) when the variance is unknown. 



Non-Gaussian or heteroscedastic data have never been considered. This article contributes 
to fill this gap in the theoretical understanding of penalization procedures. 

The calibration algorithm proposed in this article relies on a generalization of Birge 
and Massart's slope heuristics (Section I2.3p . In Section [31 the algorithm is defined in the 
least-squares regression framework, for general-shape penalties. The shape of the penalty 
itself can be estimated from the data, as explained in Section 13.41 

The theoretical validation of the algorithm is provided in Section HI from the non- 
asymptotic point of view. Non-asymptotic means in particular that the collection of models 
is allowed to depend on n: in practice, it is usual to allow the number of explanatory 
variables to increase with the number of observations. Considering models with a large 
number of parameters (for example of the order of a power of the sample size n) is also 
necessary to approximate functions belonging to a general approximation space. Thus, 
the non-asymptotic point of view allows us not to assume that the regression function is 
described with a small number of parameters. 

The existence of minimal penalties for heteroscedatic regression with a random design 
(Theorem [2]) is proved in Section 14.31 In Section 14.41 by proving that twice the minimal 
penalty has some optimality properties (Theorem [3]) , we extend the so-called slope heuris- 
tics to heteroscedatic regression with a random design. Moreover, neither Theorem [2] nor 
Theorem [3] assume the data to be Gaussian; only mild moment assumptions are required. 

For proving Theorems [2] and El each model is assumed to be the vector space of piecewise 
constant functions on some partition of the feature space. This is indeed a restriction, but 
we conjecture that it is mainly technical, and that the slope heuristics remains valid at 
least in the general least-squares regression framework. We provide some evidence for this 
by proving two key concentration inequalities without the restriction to piecewise constant 
functions. Another argument supporting this conjecture is that recently several simulation 
studies have shown that the slope heuristics can be used in several frameworks: mixture 
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mode ls (jMaugis and Michel 12008!). clustering (|Baudrvl . 120071 ) , spatial stat istics (jVerzelenl . 
20081 ) . estimation of oil reserves ( Lepez . 2002 ) and genomics ( Villers . 2007 ). Although the 
slope heuristics has not been formally validated in these frameworks, this article is a first 
step towards such a validation, by proving that the slope heuristics can be applied whatever 
the shape of the ideal penalty. 

This paper is organized as follows. The framework and the slope heuristics are described 
in Section [2j The resulting algorithm is defined in Section [3j The main theoretical results 
are stated in Section SJ All the proofs are given in Appendix [Al 



2. Framework 

In this section, we describe the framework and the general slope heuristics. 
2.1 Least-squares regression 

Suppose we observe some data (Xx,Y{), . . . (X n ,Y n ) £ X x R, independent with common 
distribution P, where the feature space X is typically a compact set of R d . The goal is to 
predict Y given X, where (X,Y) ~ P is a new data point independent of (Xi,Yi)i<i< n . 
Denoting by s the regression function, that is s(x) = E [Y | X = x] for every x £ X, we 
can write 

Y t = a(Xi) + a(Xi)ei (1) 

where a : X \— > R is the heteroscedastic noise level and e% are i.i.d. centered noise terms, 
possibly dependent on Xi, but with mean and variance 1 conditionally to X^. 

The quality of a predictor t : X ^ y is measured by the (quadratic) prediction loss 

E ix>Y) ^ P [ 7 (t,(X,Y))]=:Pj(t) where 7 (t, (x, y)) = (t(x) - y f 

is the least-squares contrast. The minimizer of Pj(t) over the set of all predictors, called 
Bayes predictor, is the regression function s. Therefore, the excess loss is defined as 

£(s,t) ■.= P 1 (t)-P 1 (s)=E {xx ^ P (t(X)-s(X)) 2 . 

Given a particular set of predictors S m (called a model), we define the best predictor over 
S m as 

s m := arg min {Pj(t)} , 

tes 

with its empirical counterpart 

s m := arg min {P„7(i) } 

(when it exists and is unique), where P n = n _1 Y17=l ^{Xi,Yi)' This estimator is the well- 
known empirical risk minimizer, also called least-squares estimator since 7 is the least- 
squares contrast. 



4 



Data-driven Calibration of Penalties 



2.2 Ideal model selection 

Let us assume that we are given a family of models {S m ) m& M n i hence a family of estimators 
(%)meM„ obtained by empirical risk minimization. The model selection problem consists 
in looking for some data-dependent in € M n such that £ (s, %j) is as small as possible. For 
instance, it would be convenient to prove some oracle inequality of the form 

i(s,Sfn)<C inf {£(s,s m )} + R n 

meMn 

in expectation or on an event of large probability, with leading constant C close to 1 and 
R n = o(n _1 ). 

General penalization procedures can be described as follows. Let pen : M n i— > R + be 
some penalty function, possibly data-dependent, and define 

fh G arg min {crit(m)} with crit(m) := P n j(sm) + pen(m) . (2) 

Since the ideal criterion crit(m) is the true prediction error P7 (s m ), the ideal penalty is 

pen id (m) := P^{s m ) - P n l{s m ) ■ 

This quantity is unknown because it depends on the true distribution P. A natural idea is 
to choose pen(m) as close as possible to pen id (m) for every m £ M. n . We will show below, 
in a general setting, that when pen is a good estimator of the ideal penalty pen id , then m 
satisfies an oracle inequality with leading constant C close to 1. 

By definition of fh, 

Vm G M n , Pnlisfh) < Pnl{sm) + pen(m) - pen(m) . 
For every m £ M n , we define 

Pi(ra) = P(7(s m ) -j{s m )) P2(m) = P n {^(s m ) - 7(s m )) S(m) = (P n - P) (j(s m )) 
so that 

pen id (m) = px(m) + p2{m) — 5{m) 
and £(s,s m ) = P n j(s m ) + pi(m) + p 2 (m) - 5(m) - P~/(s) . 

Hence, for every m G -M n , 

^(s,Sm) + (pen-pen id )(m) < £(s,s m ) + (pen - pen id )(m) . (3) 

Therefore, in order to derive an oracle inequality from ([3]), it is sufficient to show that for 
every m £ Ai n , pen(m) is close to pen id (m). 
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2.3 The slope heuristics 

If the penalty is too big, the left-hand side of ([3]) is larger than £ (s,s^) so that ([3]) implies 
an oracle inequality, possibly with large leading constant C. On the contrary, if the penalty 
is too small, the left-hand side of (J3J) may become negligible with respect to £(s,s'ff l ) 
(which would make C explode) or — worse — may be nonpositive. In the latter case, no 
oracle inequality may be derived from ([3]). We shall see in the following that i ( s, ) blows 
up if and only if the penalty is smaller than some "minimal penalty" . 

Let us consider first the case pen(m) = P2(m) in (J2|). Then, E [crit(m) ] =E[P„7(s m )] = 
Pj ( s m)j s° that fh approximately minimizes its bias. Therefore, fh is one of the more com- 
plex models, and the risk of %j is large. Let us assume now that pen(m) = Kp2(m). If 
< K < 1, crit(m) is a decreasing function of the complexity of m, so that fh is again one of 
the more complex models. On the contrary, if K > 1, crit(m) increases with the complexity 
of m (at least for the largest models), so that fh has a small or medium complexity. This 
argument supports the conjecture that the "minimal amount of penalty" required for the 
model selection procedure to work is p2(m)- 

In many frameworks such as the one of Section 14.11 it turns out that 

Hence, the ideal penalty pen id (m) pi(m) + p2(m) is close to 2p2(m). Since P2(m) is a 
"minimal penalty", the optimal penalty is close to twice the minimal penalty: 

pen id (m) m 2pen min (m) . 

This is the so-called "slope heuristics", first introduced by Birge and Massart ( 20071 ) 



m 



a Gaussian homoscedastic setting. Note that a formal proof of the validity of the slope 
heuristics ha s only been given for Gau ssian homoscedastic least-squares regression with a 
fixed design ( Birge and Massart . 20071 ): up to the best of our knowledge, the present paper 



yields the second theoretical result on the slope heuristics. 

This heuristics has some applications because the minimal penalty can be estimated 
from the data. Indeed, when the penalty smaller than pen min , the selected model fh is 
among the more complex. On the contrary, when the penalty is larger than pen min , the 
complexity of fh is much smaller. This leads to the algorithm described in the next section. 

3. A data-driven calibration algorithm 

Now, a data-driven calibration algorithm for penali z ation procedures can be de fined, gen 



eralizing a method proposed by iBirge and Massart! (|2007l ) and implemented by iLebarbiei 
(|2005h . 



3.1 The general algorithm 

Assume that the shape pen shape : A4 n i— > R + of the ideal penalty is known, from some 
prior knowledge or because it had first been estimated, see Section 13.41 Then, the penalty 
K* pen shape provides an approximately optimal procedure, for some unknown constant K* > 
0. The goal is to find some K such that K pen shape is approximately optimal. 
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Let D m be some known complexity measure of the model m G A4 n . Typically, when 
the models are finite-dimensional vector spaces, D m is the dimension of S m . According to 
the "slope heuristics" detailed in Section 12.31 the following algorithm provides an optimal 
calibration of the penalty pen shape . 

Algorithm 1 (Data-driven penalization with slope heuristics) 

1. Compute the selected model fh{K) as a function of K > 

m(K) earg mm {P n j(s m ) + K pen sh ape (m)} . 

2. Find K m \ n > such that Df^i^ is "huge" for K < K mm and "reasonably small" for 
K > K min . 

3. Select the model fh := fh (^2K m { n j . 

A computationally efficient way to perform the first step of Algorithm [T] is provided in 
Section 13.21 The accurate definition of K m \ n is discussed in Section 13.31 including explicit 
values for "huge" and "reasonably small"). Then, once P n j (s m ) and pen sh (m) are known 
for every m G A4 n , the complexity of Algorithm [1] is 0(Card(A / f n ) 2 ) (see Algorithm [2] and 
Proposition [T]). This can be a decisive advantage compared to cross-validation methods, as 
discussed in Section 14.61 



3.2 Computation of (rh(K) ) K>0 

Step 1 of Algorithm [T] requires to compute fh(K) for every K G (0, +oo). A computationally 
efficient way to perform this step is described in this subsection. 
We start with some notations: 

Vm G M n , f(m) = P n j (s m ) g{m) = pen shapc (m) 

and V-fT > 0, rh{K) G arg min {/(m) + Kg(m)} . 

m£Mn 

Since the latter definition can be ambiguous, let us choose any total ordering ^ on Ai n 
such that g is non-decreasing, which is always possible if M n is at most countable. Then, 
rh{K) is defined as the smallest element of 

E(K) := arg min {f(m) + Kg(m) } 

m£M„ 

for ^. The main reason why the whole trajectory (m(K)) K>0 can be computed efficiently 
is its particular shape. 

Indeed, the proof of Proposition Q] shows that K ^ fh(K) is piecewise constant, and 
non-increasing for ^. Then, the whole trajectory (m(K)) K>0 can be summarized by 

• the number of jumps i max G { 0, . . . , Card(A / i n ) — 1 }, 

• the location of the jumps: an increasing sequence of nonnegative reals (-Ki)o<i<i max +l> 
with K Q = and ifi max +i = +oo, 
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• a non-increasing sequence of models (m.j)o<i<i max , 

with Vi G { 0, . . . , i max } , Vif G [iQ, -Ki+i) , m(K) = rm . 

Algorithm 2 (Step 1 of Algorithm [1]) For every m G M n , define f(m) = Pnl (sm) 
and g(m) = pen shape (m). Choose ^ any total ordering on A4 n such that g is non- decreasing. 

• Init: Kq := 0, mo := arg min me _A/f n {f( m )} (when this minimum is attained several 
times, mo * s defined as the smallest one with respect to -<). 

• Step i, i > 1: Let 

G(mj-i) := {?n G M n s.t. f(m) > /(mj_i) and g(m) < g(mj_i) } . 
// = ; then put Ki = +oo, i max = i — 1 and stop. Otherwise, 

Ki := mf <^ — r — — s.t. m G G(m^i) } (4) 



and rrii : = minFj with Fi : = arg min 



f(m) - /(mj_i) 



=< ' meG(mi_!) [ g{mi-i) - g{m) 

Proposition 1 (Correctness of Algorithm [2]) If Ai n is finite, Algorithm® terminates 
and i max < Card(A4 n ) — 1. With the notations of Algorithm® let fh(K) be the smallest 
element of 

E{K) : = arg min {f(m) + Kg(m)} with respect to X . 
m£M„ 

Then, GKi)o<i<i max +l is increasing and Vi G {0, ... ,i max - 1}, ViY G [Ki,K i+ i), m(K) = 
rrii. 

It is proved in Section IA-21 In the change-point detection framework, a similar result has 



been proved by Lavielk ( 20051 ) 



Proposition [T] also gives an upper bound on the computational complexity of Algo- 
rithm[2j since the complexity of each step is 0(CardA4 n ), Algorithm [2] requires less than 
0{ima,x Card Ai n ) < 0((Card M. n ) 2 ) operations. In general, this upper bound is pessimistic 
since i max <C Card.M n . 

3.3 Definition of K min 

Step 2 of Algorithm [Q estimates K m \ n such that K m i n pen shape is the minimal penalty. The 
purpose of this subsection is to define properly K m \ n as a function of (m(K))K>o- 

According to the slope heuristics described in Section 12.31 K m \ a corresponds to a "com- 
plexity jump". If K < K m i n , fh(K) has a large complexity, whereas if K > iY m i n , rh(K) has 
a small or medium complexity. Therefore, the two following definitions of K m { n are natural. 

Let -Dthresh be the largest "reasonably small" complexity, meaning the models with larger 
complexities should not be selected. When D m is the dimension of S m as a vector space, 
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(b) Two jumps, two values for K n 



Figure 1: D^ K ^ as a function of K for two different samples. Data are simulated according 
to Q with n = 200, X { ~ W([0,1]), e { ~ AA(0,1), s(x) = sin(vrx) and a = 1. 
The models {S m )m&M n are the sets of piecewise constant functions on regular 
partitions of [0, 1], with dimensions between 1 and n/(ln(n)). The penalty shape 
is pen shape (m) = D m and the dime nsion threshold is Afresh = 19 ~ n/(21n(n)). 
See experiment SI by lArlotl (|2008a . Section 6.1) for details. 



-^thresh n/(ln(n)) or n/(ln(n)) 2 are natural choices since the dimension of the oracle is 
likely to be of order n a for some a £ (0, 1). Then, define 

K min := inf [K > s.t. D^ K) < Ahresh } • (thresh) 

With this definition, Algorithm [2] can be stopped as soon as the threshold is reached. 
Another idea is that K min should match with the largest complexity jump: 

K min := Ki with i jump = arg max { D mi+1 - D mi } . (max jump) 

i£{ 0,...,imax— 1 } 

In order to ensure that there is a clear jump in the sequence (f mj )i>o, it may be useful to 
add a few models of large complexity. 

As an illustration, we compared the two definitions above ("threshold" and "maximal 
jump") on 1000 simulated samples. The exact simulation framework is described below 
Figure HJ Three cases occured: 

1. There is one clear jump. Both definitions give the same value for K m \ n . This occured 
for about 85% of the samples; an example is given on Figure [Th.. 

2. There are several jumps corresponding to close values of K. Definitions (jthreshl) and 
( max jump] ) give slightly different values for K min , but the selected models fh ^2K min ^ 
are equal. This occured for about 8.5% of the samples. 
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3. There are several jumps corresponding to distant values of K. Definitions (jthreshp and 



(max jump) strongly disagree, giving different selected models fh I 2K m \ n I at final 



This occured for about 6.5% of the samples; an example is given on Figure [Tb. 

The only problematic case is the third one, in which an arbitrary choice has to be made 
between definitions (jthreshp and ( max jumpp . 



With the same simulated data, we have compared the prediction errors of the two 
methods by estimating the constant C or that would appear in some oracle inequality, 

E[*(a,%)] 



E[m£ meMn {£(s,s m )}] 
With definition (Ithreshl) C or ~ 1.88; with definition ( |max jump ) C or ~ 2.01. For both 



methods, the standard error of the estimation is 0.04. As a comparison, Mallows' C p with 
a classical estimator of the variance o 2 has an estimated performance C or ~ 1.93 on the 
same data. 

The overall conclusion of this simulation experiment is that Algorithm [1] can be com- 
petitive with Mallows' C p in a framework where Mallows' C p is known to be optimal. 
Definition (jthreshp for K m \ n seems slightly more efficient than (|max jump), but without 



convincing evidence. Indeed, both definitions depend on some arbitrary choices: the value 
of the threshold -Dthresh in (jthreshp , the maximal complexity among the collection of models 



(Sm)meMn m ( rnax jump ). When n is small, say n = 200, choosing -Dthresh is tricky since 



n/(21n(n)) and y/n are quite close. Then, the difference between (jthreshp and (max jump) 



is likely to come mainly from the particular choice -Dthresh = 19 than from basic differences 
between the two definitions. 

In order to estimate iT m i n as automatically as possible, we suggest to combine the two 
definitions; when the selected models rh{2K m \ n ) differ, send a warning to the final user 
advising him to look at the curve K i— > Df^rx) himself; otherwise, remain confident in the 
automatic choice of m(2.fr m i n ). 

3.4 Penalty shape 

For using Algorithm [1] in practice, it is necessary to know a priori, or at least to estimate, 
the optimal shape pen shape of the penalty. Let us explain how this can be achieved in 
different frameworks. 

The first example that comes to mind is pen shape (m) = D m . It is valid for homoscedastic 
least-squares regression on linear models, as shown by several papers mentioned in Section[TJ 
Indeed, when Card(A4 n ) is smaller than some power of n, Mallows' C p penalty — defined 
by pen(m) = 2E [<r 2 (X)] n~ 1 D m — is w ell known to be asymptotically optimal. For larger 



collections A4 n , more elaborate results (jBirge and Massartl . 120011 . 120071 ) have shown that a 



penalty proportional to ln(n)E [<r 2 (AT)] n~ 1 D m and depending on the size of M. n is asymp- 
totically optimal. 

Algorithm Q] then provides an alternative to plugging an estimator of E [a 2 (X)] into 
the above penalties. Let us detail two main advantages of our approach. First, we avoid the 
difficult task of estimating E [<r 2 (X)] without knowing in advance some model to which the 
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true regression function belongs. Algorithm Q] provides a model-free estimation of the factor 
multiplying the penalty. Second, the estimator a 2 of E [<r 2 (X)] with the smallest quadratic 
risk is certainly far from being the optimal one for model selection. For instance, under- 
estimating the multiplicative factor is well-known to lead to poor performances, whereas 
overestimating the multiplicative factor does not increase much the prediction error in gen- 
eral. Then, a good estimator of E [<r 2 (X)] for model selection should overestimate it with a 
probability larger than 1/2. Algorithm [1] satisfies this property automatically because K mm 
so that the selected model cannot be too large. 

In short, Algorithm^ with pen shape (m) = D m is quite different from a simple plug-in 
version of Mallows' C p . It leads to a really data- dependent penalty, which may perform 
better in practice than the best deterministic penalty K*D m . 

In a more general framework, Algorithm [T] allows to choose a different shape of penalty 
pen sh . For instance, in the heteroscedastic least-squares regression framework of Sec- 
tion 12.11 the optimal penalty is no longer proport ional to the d imension D m of the models. 
This can be shown from computations made by (jArlotl . l2008d . Proposition 1) when S, 
assumed to be the vector space of piecewise constant functions on a partition 
X: 



AeA„ 



is 
of 



E[pen id (m)] = E [(P - P„) 7 (? m )] « - V E [a(X) 2 | Xel 



(5) 



AeA„ 



An exact result has been proved by Arlot (2008c, Proposition 1). Moreover, Arlot ( 2008al ) 
gave an example of model selection problem in which no penalty proportional to D m can 
be asymptotically optimal. 

A first way to estimate the shape of the penalty is simply to use © to compute pen shape , 
when both the distribution of X and the shape of the noise level a are known. In practice, 
one has seldom such a prior knowledge. 

We suggest in t his sit u ation t o use resampling penalties ( Efron . 19831 : Arlot . 2008c]), 
or V -fold penalties ( Arlotl . 2008bl ) which have much smaller computational costs. Up to 
a multiplicative factor (automatically estimated by Algorithm Q]) , these penalties should 
estimate correctly E[pen id (m)] in any framework. In particular, resampling and V-fold 
penal ties are asymp totically optimal in the heteroscedastic least-squares regression frame- 
work (j Arlotl . l2008bf ldY 



3.5 The general prediction framework 

Section [2] and definition of Algorithm Q] have restricted ourselves to the least-squares regres- 
sion framework. Actually, this is not necessary at all to make Algorithm Q] well-defined, so 
that it can naturally be extended to the general prediction framework. More precisely, the 
(Xi, Yi) can be assumed to belong to X%y for some general y, and 7 : Sx(Xxy) 1— ► [0; +00) 
any contrast function. In particular, y = {0, 1} leads to the binary classification problem, 
for which a natural contrast function is the 0-1 loss j(t;(x,y)) = \ t i x )^y I n this case, 
the shape of the penalty pen shape can for instance be estimated with the global or local 
Rademacher complexities mentioned in Section [TJ 

However, a natural question is whether the slope heuristics of Section 12.31 upon which 
Algorithm [T] relies, can be extended to the general framework. Several concentration results 
used to prove the validity of the slope heuristics in the least-squares regression framework in 
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this article are valid in a general setting including binary classification. Even if the factor 2 
coming from the closeness of E [p\] and E [pg] (see Section[2]3]) may not be universally valid, 
we conjecture that Algorithm [1] can be used in other settings than least-squares regression. 
Moreover, as already mentioned at the end of Section [H empirical studies have shown that 
Algorithm [T] can be successfully applied to several problems, with different shapes for the 
penalty. To our knowledge, to give a formal proof of this fact remains an interesting open 
problem. 

4. Theoretical results 

Algorithm [1] mainly relies on the "slope heuristics" , developed in Section 12.21 The goal of 
this section is to provide a theoretical justification of this heuristics. 

It is split into two main results. First, Theorem [2] provides lower bounds on and the 
risk of %j when the penalty is smaller than pen min (m) := E [p2(m)]. Second, Theorem[3]is 
an oracle inequality with leading constant almost one when pen(m) ~ 2E[p2(w)], relying 
on ([3]) and the comparison p\ ~ P2 • 

In order to prove both theorems, two probabilistic results are necessary. First, p\, p2 and 
S concentrate around their expectations; for p2 and 6, it is proved in a general framework in 
Appendix IA.6I Second, E [pi{m) \ ~ E [p2(w-)] for every m € A4 n . The latter point is quite 
hard to prove in general, so that we must make an assumption on the models. Therefore, 
in this section, we restrict ourselves to the regressogram case, assuming that for every 
m £ M n , S m is the set of piecewise constant functions on some fixed partition 
of X. This framework is described precisely in the next subsection. Although we do not 
consider regressograms as a final goal, the theoretical results proved for regressograms help 
to understand better how to use Algorithm [T] in practice. 

4.1 Regressograms 

Let S m be the the set of piecewise constant functions on some partition (/a) AeA m of X. 
The empirical risk minimizer s m on S m is called a regressogram. S m is a vector space of 
dimension D m = Card(A m ), spanned by the family (l/ A )AeA m • Since this basis is orthogonal 
in L 2 (fi) for any probability measure /i on X, computations are quite easy. In particular, 
we have: 

s m = ^2 P\^l x and s m = ^ Px^h > 
AeA m AeA m 

where 

X := E P [Y | X € I X ] A := -4- V Y % Px ■= P n (X G h) . 

Note that s m is uniquely defined if and only if each I\ contains at least one of the Xi. 
Otherwise, s ra is not uniquely defined and we consider that the model m cannot be chosen. 
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4.2 Main assumptions 

In this section, we make the following assumptions. First, each model Sm is a set of piecewise 
constants functions on some fixed partition (I\)xeA m of X. Second, the family (S m ) m ^M n 
satisfies: 

(PI) Polynomial complexity of A4 n : Card(.M n ) < cj^n aM . 
(P2) Richness of M. n : 3mo G M. n s.t. D mo G [\/n, c T i c h\/n]. 

Assumption (PI) is quite classical for proving the asym ptotic o ptimality of a model selection 
procedure; it is for instance implicitly assumed by |lj (119871 ) in the homoscedastic fixed- 
design case. Assumption (P2) is merely technical and can be changed if necessary; it only 
ensures that {S m ) m ^M n does not contain only models which are either too small or too 
large. 

For any penalty function pen : A4 n i— > M + , we define the following model selection 
procedure: 

rh G arg min _ { P n 7(s~ m ) + pen(m) } . (6) 

meA4„,min AgAm {p A }>0 

Moreover, the data (Xj,li)i<j< n are assumed to be i.i.d. and to satisfy: 
(Ab) The data are bounded: H^H^ < A < oo. 

(An) Uniform lower-bound on the noise level: cr(Aj) > cr m i n > a.s. 
(Ap u ) The bias decreases as a power of D m : there exist some /?+, C+ > such that 

£(s,s m )<C + Dj+ . 

(Ar^-) Lower regularity of the partitions for C{X): D m minx 6 A m { P {X G I\ ) } > c^. 

Further comments are made in Sections 14.31 and 14.41 about these assumptions, in particular 
about their possible weakening. 

4.3 Minimal penalties 

Our first result concerns the existence of a minimal penalty. In this subsection, (P2) is 
replaced by the following strongest assumption: 

(P2+) 3c ,c rich > s.t. V/ G [^/n,c n/(c ric hln(n))], 3m G M n s.t. D m G [Z,c rich Z]. 

The reason why (P2) is not sufficient to prove Theorem [2] below is that at least one model 
of dimension of order n/ln(n) should belong to the family (S m ) m€ _ M ; otherwise, it may 
not be possible to prove that such models are selected by penalization procedures beyond 
the minimal penalty. 

Theorem 2 Suppose all the assumptions of Section \^.S\ are satisfied. Let K G [0; 1), L > 0, 
and assume that an event of probability at least 1 — Ln~ 2 exists on which 

Vm£M n , 0<pen(m) < KE[P n (^(s m ) - 7 (s m ))] . (7) 
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Then, there exist two positive constants K\, K 2 such that, with probability at least 1 — 

Km~ 2 , 

D^>K 2 n\n(n)- 1 , 
where rh is defined by ([6]). On the same event, 

£(s,srn)>ln(n) inf {l(s,s m )} . (8) 

The constants K\ and K 2 may depend on K , L and constants in (PI), (P2+), (Ab) ; 
(An), (Ap u ) and (Ar^), but do not depend on n. 

This theorem thus validates the first part of the heuristics of Section 12.31 proving that 
a minimal amount of penalization is required; when the penalty is smaller, the selected 
dimension and the quadratic risk of the final estimator £ (s, s^) blow up. This coupling 
is quite interesting, since the dimension is known in practice, contrary to £(s,'sfn)- It 
is then possible to detect from the data whether the penalty is too small, as proposed in 
Algorithm [TJ 

The main interest of this result is its combination with Theorem [3] below. Neverthe- 
less Theorem [2] is also interesting by itself for understandin g the theoretical propertie s of 
penalization procedures. Indeed, it generalizes the results of Birge and Massart ( 20071 ) on 



the existence of minimal penalties to heteroscedastic regression with a random design, even 
if we have to restrict to regressograms. Moreover, we have a general formulation for the 
minimal penalty 

pen min (m) := E[P n (j(s m ) - j{s m ))} =E[p 2 (m)] , 

which can be used in frameworks situations where it is not proportional to the dimension 
D m of the models (see Section ET41 and references therein). 

In addition, assumptions (Ab) and (An) on the data are much weaker than the Gaussian 
homoscedastic assumption. They are also much more realistic, and moreover can be strongly 
relaxed. Roughly speaking, boundedness of data can be replaced by conditions on moments 
of the noise, and the uniform lower boun d o~ m \n i s no lo nger necessary when a satisfies some 



mild regularity assumptions. We refer to lArlot] (|2008d . Section 4.3) for detailed statements 



of these assumptions, and explanations on how to adapt proofs to these situations. 

Finally, let us comment on conditions (Ap u ) and (Ar^). The upper bound (Ap u ) on 
the bias occurs in the most reasonable situations, for instance when X C M. k is bounded, the 
partition ( -/a ) AeA m ^ s re g u l ar an d the regression function s is a-H61derian for some a > 
depending on a and k). It ensures that medium and large models have a significantly 
smaller bias than smaller ones; otherwise, the selected dimension would be allowed to be 
too small with significant probability. On the other hand, (Ar^) is satisfied at least for 
"almost regular" partitions ( I\ )^gA > wnen X has a lower bounded density w.r.t. the 
Lebesgue measure on X C M. k . 

Theorem[2]is stated with a general formulation of (Ap u ) and (Ar?), instead of assuming 
for instance that s is a-H61derian and X has a lower bounded density w.r.t Leb, in order to 
point out the generality of the "minimal penalization" phenomenon. It occurs as soon as the 
models are not too much pathological. In particular, we do not make any assumption on the 
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distribution of X itself, but only that the models are not too badly chosen according to this 
distribution. Such a condition can be checked in practice if some prior knowledge on C{X) 
is available; if part of the data are unlabeled — a usual case — , classic al density estimation 
proce dures can be applied for estimating C(X) from unlabeled data (jDevrove and Lugosil . 
l200ll ) . 

4.4 Optimal penalties 

Algorithm Q] relies on a link between the minimal penalty pointed out by Theorem [2] and 
some optimal penalty. The following result is a formal proof of this link in the framework 
we consider: penalties close to twice the minimal penalty satisfy an oracle inequality with 
leading constant approximately equal to one. 

Theorem 3 Suppose all the assumptions of Section \4.2\ are satisfied together with 

(Ap) The bias decreases like a power of D m : there exist /?_ > /?+ > and C+, C_ > such 
that 

C-D-P- <£(s,s m )<C + D-P+ . 

Let 5 £ (0, 1), L > 0, and assume that an event of probability at least 1 — Ln~ 2 exists on 
which for every m £ M. n , 

(2-<J)E[P n ( 7 (s m )-7(sm))] <pen(m) < (2 + 5)E [P n ( 7 (s m ) - 7 (? m ))] . (9) 

Then, for every < r/ < min{/?+; 1} /2, there exist a constant K3 and a sequence e n 
tending to zero at infinity such that, with probability at least 1 — K%n~ 2 , 

and £(s,Sfh) < ( + e n J inf {£(s,s m )} , (10) 

V 1 — I m£M n 



where fh is defined by ([6]). Moreover, we have the oracle inequality 



E[U *.•;,„)] £ ( \^ + £n )E 



inf UO,? m )} 

m€M n 



+ 



A 2 K 3 
n 2 



The constant K 3 may depend on L,5,r/ and the constants in (PI), (P2), (Ab), (An), 
(Ap) and (Ar^), but not on n. The term e n is smaller than h^n) -1 / 5 ; it can be made 
smaller than n~ s for any 5 £ (0; <5q(/3-, /?+)) at the price of enlarging K%. 



This theorem shows that twice the minimal penalty pen min pointed out by Theorem [2] 
satisfies an oracle inequality with leading constant almost one. In other words, the slope 
heuristics of Section T2.3I is valid. The consequences of the combination of Theorems [2] and [3] 
are detailed in Section 14.51 

The oracle inequality (|10p remains valid when the penalty is only close to twice the 
minimal one. In particular, the shape of the penalty can be estimated by resampling as 
suggested in Section 13.41 
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Actually, Theorem [3] above is a corollary of a more general result stated in Appendix lA.3l 
Theorem [5j If 

pen(m) «^E[P n ( 7 (s m ) - j(s m ))] (11) 

instead of ([9]), under the same assumptions, an oracle inequality with leading constant 
C(K) + e n instead of 1 + e n holds with large probability. The constant C(K) is equal to 
(K - I)- 1 when K G (1, 2] and to C(K) = K - 1 when K > 2. Therefore, for every K > 1, 
the penalty defined by (jll|) is efficient up to a multiplicative constant. This result is new 
in the heteroscedastic framework. 

Let us comment the additional assumption (Ap), that is the lower bound on the bias. 
Assuming I (s, S m) > for ever y m g M.„ is classical for proving th e asymptotic optimality 
of Mallows' C„ dShibatal. Il98ll : IL1 Il987l : lilrge and Massart] . \200i ). (Ap) has been made 
by IStoneJ (|l983 ) and iBurmanl in the density estimation framework, for the same 

technical reasons as ours. Assumption (Ap) is satisfied in several frameworks, such as 
the following: \<=A m is "regular", X has a lower-bounded density w.r.t. the Lebesgue 



measure on X C W, and s is non-constant and a-holderian (w.r.t. 



with 



01 



kr 1 + or 1 



(k - l)k 



and 02 = 2a/c 



We refer to lArlotJ (|2007l . Section 8.10) for a complete proof. 

When the lower bound in (Ap) is n o longe r assum ed, (|10j) holds with two modifications 
in its right-hand side (for details, see lArlotl . 12008a . Remark 9): the inf is restricted to 



models of dimension larger than ln(n) 71 , and there is a remainder term ln(n) 72 n _1 , where 
7i;72 > are numerical constants. This is equivalent to (llOj) . unless there is a model of 
small dimension with a small bias. The lower bound in (Ap) ensures that it cannot happen. 
Note that if there is a small model close to s, it is hopeless to obtain an oracle inequality with 
a penalty which estimates pen id , simply because deviations of pen id around its expectation 
would be much larger than the excess loss of the oracle . In such a situation, B I C-like 
methods are more appropriate; for instance, Csiszar ( 20021 ) and Csiszar and Shield j (l2000h 
showed that BIC penalties are minimal penalties for estimating the order of a Markov chain. 



4.5 Main theoretical and practical consequences 

The slope heuristics and the correctness of Algorithm Q] follow from the combination of 
Theorems [2] and El 



4.5.1 Optimal and minimal penalties 

For the sake of simplicity, let us consider the penalty K~E[p2(m)] with any K > 0; any 
penalty close to this one satisfies similar properties. At first reading, one can think of the 
homoscedastic case where E [^(w)] ~ & 2 D m n~ l ; one of the novelties of our results is that 
the general picture is quite similar. 

According to Theorem [3l the penalization procedure associated with KK [p2(m)] satis- 
fies an oracle inequality wit h lead i ng con stant C n (K) as soon as K > 1, and C n (2) ~ 1. 
Moreover, results proved by Arlot ( 2008bl ) imply that C n (K) > C(K) > 1 as soon as K is 
not close to 2. Therefore, K = 2 is the optimal multiplying factor in front of E [p2(m)]. 
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When K < 1, Theorem [2] shows that no oracle inequality can hold with leading constant 
C n {K) < ln(n). Since C n {K) < (K-l)- 1 < ln(n) as soon as K > l + ln(n) _1 , K = 1 is the 
minimal multiplying factor in front of E {p2{m)\. More generally, pen min (m) := E [^(t^)] 
is proved to be a minimal penalty. 

In short, Theorems [2] and [3] prove the slope heuristics described in Section [2.31 
"optimal" penalty ~ 2 x "minimal" penalty . 



Birge and Massartl (|2007l ) have proved the validity of the slope heuristics in the Gaussian 
homoscedastic framework. This paper extends their result to a non-Gaussian and het- 
eroscedastic setting. 

4.5.2 Dimension jump 

In addition, Theorems [2] and [3] prove the existence of a crucial phenomenon: there ex- 
ists a "dimension jump" — complexity jump in the general framework — around the minimal 
penalty. Let us consider again the penalty KE [p2{m)\. As in Algorithm [H let us define 

m(K) G arg min {P n j (s m ) + KE[p 2 (m)}} . 

meMn 

A careful look at the proofs of Theorems and [3] shows that there exist constants K$ > 
and an event of probability 1 — K^n~ 2 on which 

vo < K < 1 - sky D&m - W and VA ' > 1 + W)' D "~" K> - ' (12) 

Therefore, the dimension Dfh{K) of the selected model jumps around the minimal value 
K = 1, from values of order n(ln(n))~ 2 to n 1 ^ 71 . 

Let us know explain why Algorithm [1] is correct, assuming that pen shape (m) is close 

to E[p2(™)]- With definition (jthreshp of K m \ n and a threshold -Dthresh °c n(ln(n))~ 3 , (fT2]) 
ensures that 

1 " fTT < £min < 1 + fTT 

m(nj m(nj 

with a large probability. Then, according to Theorem [31 the output of Algorithm Q] satisfies 
an oracle inequality with leading constant C n tending to one as n tends to infinity. 

4.6 Comparison with data-splitting methods 

Tuning parameters are often chosen by cross-validation or by another data-splitting method, 
which suffer from some drawbacks compared to Algorithm [TJ 

First, V-fold cross-validation, leave-p-out and repeated learning-testing methods require 
a larger computation time. Indeed, they need to perform the empirical risk minimization 
process for each model several times, whereas Algorithm [T] only needs to perform it once. 

Second, F-fq ld cross-validation is asymptotically suboptimal when V is fixed, as shown 



by (jArlotl . l2008bl . Theorem 1). The same suboptimality result is valid for the hold-out, when 
the size of the training set is not asymptotically equivalent to the sample size n. On the 
contrary, Theorems [2] and [3] prove that Algorithm Q] is asymptotically optimal in a framework 
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including the one used by (jArlotl . l2008bl . Theorem 1) for proving the suboptimality of V-fold 
cross-validation. Hence, the quadratic risk of Algorithm Q] should be smaller, within a factor 

K > 1. 

Third, hold-out with a training set of size n% ~ n, for instance nt = n — yfn or i%t = 
n(l — ln(n) -1 ), is known to be unstable. The final output fh strongly depends on the 
choice of a particular split of the data. According to the simulation study of Section 13.31 
Algorithm Q] is far more stable. 

To conclude, compared to data splitting methods, Algorithm Q] is either faster to com- 
pute, more efficient in terms of quadratic risk, or more stable. Then, Algorithm [T] should 
be preferred each time it can be used. Another approach is to use aggregation techniques , 



instead of s e lectin g one model. As shown by several results (see for instance iTsybakovl . 



2004 ; iLecuel . 120071 ) , aggregating estimators built upon a training simple of size nt ~ n can 
have an optimal quadratic risk. Moreover, aggregation requires approximately the same 
computation time as Algorithm [TJ and is much more stable than the hold-out. Hence, it 
can be an alternative to model selection with Algorithm [U 

5. Conclusion 

This p aper provides mathematical evidence that the method introduced bv lBirge and Massart 



42002) for desi gning data-driven penalties remains efficient in a non-Gaussian framework. 



The purpose of this conclusion is to relate the slope heuristics developed in Section [2] to the 
well known Mallows' C p and Akaike's criteria and to the unbiased estimation of the risk 
principle. 

Let us come come back to Gaussian model selection in order to explain how to guess 
what is the right penalty from the data themselves. Let 7 n be some empirical criterion 
(for instance the least-squares criterion as in this paper, or the log-likelihood criterion), 
(S m ) me ji4 n be a collection of models and for every m G M. n s m be some minimizer of 
t i — > E[j n (t)] over S m (assuming that such a point exists). Minimizing some penalized 
criterion 

ln{s m )+ pen(m) 

over Ai n amounts to minimize 

v m + pen(m) , 

where Vm £ M„, b m = -y n (s m ) - 7„ (s) and v m = 7„ ( s m ) - 7„ ( s m ) . 

The point is that b m is an unbiased estimator of the bias term £(s,s m ). Having concentra- 
tion arguments in mind, minimizing b m — v m + pen(m) can be conjectured approximately 
equivalent to minimize 

e(s,s m ) -E[v m ] +pen(m) . 

Since the purpose of model selection is to minimize the risk E [£ (s, s" m )]> an ideal penalty 
would be 

pen(m) = E [v m ] + E [!(s m ,? m )] . 

In Gaussian least-squares regression with a fixed design, the models S m are linear and 
E[t> m ] = E[£(s m ,s m )] is explicitly computable if the noise level is constant and known; 
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this leads to Mallows' C p penalty. When 7 n is the log-likelihood, 



■•)] 



D 



m 

2^7 



asymptotically, where D m stands for the number of parameters defining model S m ; this 
leads to Akaike's Information Criterion (AIC). Therefore, both Mallows' C p and Akaike's 
criterion are based on the unbiased (or asymptotically unbiased) risk estimation principle. 

This paper goes further in this direction, using that E[?; m ] ~ E[^(s m ,s m )] remains 
a valid approximation in a non-asymptotic framework. Then, a good penalty becomes 
2E [v m ] or 2v m , having in mind c oncentration arguments. Since v m is the minimal penalty, 
this explains the slope heuristics (jBirge and Massaru 120071 ) and connects it to Mallows' C p 
and Akaike's heuristics. 



The second main idea developed in this paper is that the minimal penalty can be es- 
timated from the data; Algorithm Q] uses the jump of complexity which occurs around the 
minimal penalty, as shown in Sections 13.31 and 14.5.21 Another way to estimate the minimal 
penalty when it is (at least approximately) of the form aD m is to estimate a by the slope of 
the graph of j n (s" m ) for large enough values of D m ; this method can be extended to other 
shapes of penalties, simply by replacing D m by some (known!) function f (D m ). 

The slope heuristics can even be combined with resampli ng ideas, by ta king a function 
/ built from a randomized empirical criterion. As shown by lArlot] (|2008al ) , this approach 
is much more efficient than the rougher choice / ( D m ) = D m for heteroscedastic regression 
frameworks. The question of the optimality of the slope heuristics in general remains an 
open problem; nevertheless, we believe that this heuristics can be useful in practice, and 
that proving its efficiency in this paper helps to understand it better. 



Let us finally mention that contrary to Birge and Massart ( 20071 ) . we assume in this 
paper that the collection of models M. n is "small", that is Card(A4 n ) grows at most like a 
power of n. For several problems, such that complete variable selection, larger collections 
of models have to be considered; then, it is known from the homoscedastic case that the 
minimal penalty is much larger than E [p2(m) ]. Nevertheless, Emilie Lebarbier has used the 

slope heuristics with / (D m ) = D m ( 2. 5 + In ( for m ultiple change-points detection 

from n noisy data, using the results by lBirge and Massartl (120071 ) in the Gaussian case. 

Let us now explain how we expect to generalize the slope heuristics to the non-Gaussian 
heteroscedastic case when M n is large. First, group the models according to some complex- 
ity index C m such as their dimensions D m ; for C G { 1 , . . . , n fc } , define Sc = Uc =c ^m- 
Then, replace the model selection problem with the family (S m ) m eM„ by a "complexity 
selection problem", that is model selection with the family ( Sc ) • We conjecture 



l<C<n k 

that this grouping of the models is sufficient to take into account the richness of M. n for 
the optimal calibration of the penalty. A theoretical justification of this point could rely on 
the extension of our results to any kind of model, since Sc is not a vector space in general. 
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Appendix A. Proofs 

This appendix is devoted to the proofs of the results stated in the paper. Proposition [T] is 
proved in Section IA.21 Theorem [3] is proved in Sections IA.3I and IA.4( Theorem [2] is proved 
in Section IA.5| the remaining sections are devoted to probabilistic results used in the main 
proofs and technical proofs. 



A.l Conventions and notations 

In the rest of the paper, L denotes a universal constant, not necessarily the same at each 
occurrence. When L is not universal, but depends on pi,...,pi~, it is written L pi) ... )Pfc . 
Similarly, £(sh2) (resp. £(shs)) denotes a constant allowed to depend on the parameters 
of the assumptions made in Theorem [2] (resp. Theorem [5]), including (PI) and (P2). We 
also make use of the following notations: 

• Va, b G R, a A b is the minimum of a and b, a V b is the maximum of a and b, a+ = a V 
is the positive part of a and a_ = a A is its negative part. 



V/ A C X, p\ := P(X G I x ) and a\ := E (Y - s m {X) 



X£l x 



Since E [pi(m)] is not well-defined (because of the event {mhT\gA m {pa} = 0}), we 
have to take the following convention 



pi(m) =pi(m) := ^ Pa (/3a -/3a J + ^ 

AeA m s.t. p A >o AeA m s.t. p x =o 



Remark that pi{m) = pi(m) when min^eA™ {px} > 0), so that this convention has 
no consequences on the final results (Theorems [2] and [5]). 



A. 2 Proof of Proposition [TJ 

First, since A4 n is finite, the infimum in (j3J) is attained as soon as G(rrii-i) ^ 0, so that m; 
is well defined for every i < i max - Moreover, by construction, g{rrii) decreases with i, so that 
all the nii G M n are different; hence, Algorithm [2] terminates and imax + 1 < Card(A^ n ). 
We now prove by induction the following property for every i G {0, . . . ,i ma . x }- 

Vf. Ki< K i+1 and VK G [K i} K i+1 ), m{K) = rrn . 

Notice also that Ki can always be defined by @ with the convention inf = +oo. 



Vq HOLDS TRUE 

By definition of K%, it is clear that K\ > (it may be equal to +oo if G(mo) = 0)- For 
K = Kq = 0, the definition of mo is the one of fn(0), so that fh{K) = mo- For K G (0, K\), 



20 



Data-driven Calibration of Penalties 



Lemma H shows that either m{K) = m(0) = too or fh{K) G G(0). In the latter case, by 
definition of K\, 

f(m(K)) - /(toq) >Ki>r 
g(m ) - g(m(K)) 

hence 

f(fh(K)) + Kg(m(K)) > /(mo) + Kg(m ) 
which is contradictory with the definition of m(K). Therefore, Vq holds true. 

Vi V i+ i FOR EVERY i G { 0, . . . , Z max - 1 } 

Assume that V% holds true. First, we have to prove that Ki +2 > JQ+i- Since -?Q max +i = +00, 
this is clear if % = i max — 1. Otherwise, Ki +2 < +00 and TOj + 2 exists. Then, by definition of 
TOj + 2 an d (resp. TOj+i and -fQ+i), we have 

f{m i+2 ) - f{m i+ i) = K i+2 (g(m i+1 ) - g(m i+2 )) (13) 
f{m i+ i) - /(m») = K i+1 (g(mi) - g(m i+1 )) . (14) 

Moreover, mi+2 G G(toj + i) C G(toj), and TOj + 2 ~< TOj + i (because 5 is non-decreasing). 
Using again the definition of -fQ+i, we have 

/(TO i+2 ) - /(mi) > i^i + i(5(TOi) - g(m i+2 )) (15) 

(otherwise, we would have to^+2 G -Fi+i and m«+2 -< TOj+i, which is not possible). Combining 
the difference of (1151) and (1141) with (1131). we have 



#i+2(s(«ii+l) -3(^+2)) > K i+1 (g(m i+1 ) - g(m i+2 )) , 

hence K i+2 > K i+1 , since g{rrii + \) > g{m i+2 ). 

Second, we prove that m{Ki + {) = rrii+i. From Vi, we know that for every to G M n , for 
every K G ifj+i), /(to^) + Kg(rrii) < /(to) + Kg(m). Taking the limit when K tends 
to Ki + i, it follows that TOj G E{K,i + \). By (fl4"|) . we then have rrii+i G £7(i<Q + i). On the 
other hand, if to G E(Ki+i), Lemma [J] shows that either /(to) = f(rm) and g{m) = g(rrii) 
or to G G{rni). In the first case, m-i+i -< m (because 5 is non-decreasing). In the second 
one, to G Fi + i, so TOj+i ^ to. Since m(i£j+i) is the smallest element of E{Ki+\), we have 
proved that m, + i = fh{K,i + \). 

Last, we have to prove that fh{K) = m,j+i for every K G (if i,K 2 ). From the last 
statement of Lemma HI we have either fh{K) = m{K\) or m(K\) G G(fh(K)). In the latter 
case (which is only possible if JQ+ 2 < 00), by definition of K; L+2 , 

f(fh(K)) - f{m l+l ) 
g(m i+ i) - g(m(K)) 

so that 

f(m(K)) + Kg{fh{K)) > f(m i+1 ) + Kg{m i+l ) 
which is contradictory with the definition of fh{K). ■ 
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Lemma 4 With the notations of Proposition [7] and its proof, if < K < K' , m £ E{K) 
and m! 6 E{K r ), then one of the two following statements holds true: 

(a) f(m) = f{m!) and g(m) = g(m'). 

(b) f(m) < f(m') and g(m) > g(mf). 

In particular, either fh(K) = fh(K') or m(K') £ G(fh(K)). 

Proof By definition of E(K) and E(K'), 

f(m) + Kg(m) < f{m') + Kg(m') (16) 
/(m') + K'g(m') < f(m) + K'g{m) . (17) 

Summing ([TBI) and (fT71) gives [K' — K)g(m!) < (K' — K)g(m) so that 

g(m') < g(m) . (18) 

Since K > 0, (USD and ([HI) give f(m) + Kg{m) < f(m') + Kg(m), that is 

f(m) < f(m') . (19) 

Moreover, (|19j) and (|17j) impiy g(m) = g(m'), hence f(m') < f(m), that is f(m) = f(m') 
by (|19p . Similar iy, (|16p and ()18p show that /(m.) = f(m') imply g(m) = g(m'). In both 
cases, (a) is satisfied. Otherwise, f(m) < f{m') and g(m) > g(m'), that is the (b) statement. 

The last statement follows by taking m = fh(K) and w! = m(K'), because g is non- 
decreasing, so that the minimum of g in E(K) is attained by fh(K). ■ 



A. 3 A general oracle inequality 

First of all, let us state a general theorem, from which Theorem [3] is an obvious corollary. 
Theorem 5 Suppose all the assumptions of Section \4.2\ are satisfied together with 

(Ap) The bias decreases like a power of D m : there exist (3- > /?+ > and C+, C_ > such 
that 

C-D-P-<£(8,8 m )<C+D-f > + . 

Let L, £, c\, Ci, Ci > 0, C2 > 1 and assume that an event of probability at least 1 — Ln~ 2 
exists on which, for every m E M. n such that D m > ln(n)^, 



E [ciP (j(s m ) - j(s m )) + c 2 P n (n/(s m ) - j(s m ))] 
< pen(m) < E [dP ( 7 (s m ) - 7 (%) ) + C 2 P„ ( 



7(»m))] • 



(20) 



Then, for every < n < min{/3 + ; 1} /2, i/iere exist a constant K 3 and a sequence e Ti 
tending to zero at infinity such that, with probability at least 1 — K^n~ 2 , 



and 



£(s,s 



1 + (Ci + C 2 - 2). 



< 



,ci + c 2 -l)Al 



+ e n 



inf {f(s,s m )} (21) 
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where fh is defined by ([6]). Moreover, we have the oracle inequality 
1 + (Ci + C 2 - 2) 



El 



S, Si 



;))< 



(ci +c 2 - 1) A 1 



+ Cn 



E 



inf {£(s,s m )} 

meMn 



+ 



n- 



(22) 



The constant K% may depend on L, rj, £, c\, c 2 , C\, C 2 and constants in (PI), (P2), 
(Ab), (An), (Ap) and (Ar?), but not on n. The term e n is smaller than ln(n) -1 / 5 ; it can 
be made smaller than n~ s for any 5 G (0; 5q(/3_, /?+)) at the price of enlarging K%. 



The particular form of condition (|20|) on the penalty is motivated by the fact that the 
ideal shape of penalty E[pen id (m)] (or equivalently E[2p2 is unknown in general. 

Then, it has to be estim ated f r om th e data, for instance by resampling. Under the as- 
sumptions of Theorem [U ArlotJ ( 2008bl j3) has proved that resampling and U-fold penalties 
satisfy condition (|20|) with constants c\ + c 2 = 2 — 5 n , C\ + C 2 = 2 + 5 n (for some absolute 
sequence 5 n tending to zero at infinity), and some numerical constant £ > 0. Then, Theo- 
rem [5] shows that such a penalization procedure satisfies an oracle inequality with leading 
constant tending to 1 asymptotically. 

The rationale behind Theorem [5] is that if pen(m) is close to c\p\{m) + c 2 p 2 (m), then 
crit(m) £(s,s m ) + c\pi(m) + {c 2 — l)p 2 (m). When c\ = c 2 = 1, this is exactly the ideal 
criterion £(s,'s m ). When c\ + c 2 = 2 with c\ > and c 2 > 1, we obtain the same result 
because p\{m) and p 2 {m) are quite close, at least when D m is large enough. The closeness 
between p\ and p 2 is the keystone of the slope heuristics. Notice that if max mg _A4 n D m < 
K^(ln(n))~ 1 n (for some constant depending only on the assumptions of Theorem [31 as 
K3), one can replace the condition c 2 > 1 by c\ + c 2 > 1 and ci, c 2 > . 

A. 4 Proof of Theorem [5] 



This proof is similar to the one of lArlotl (|2008d . Theorem 1). We give it for the sake of 
completeness. 

From ([3]), we have for each m E A4 n such that A n {m) := min / \ g A m {np\} > 



Hs,Sjn) - (pen( d (m) - pen(m)) < £(s,s m ) + (pen(m) - pen- d (m)) 



(23) 



with pen- d (m) := pi(m) + p 2 {m) — 5(m) = pen(m) + (P — P n )'j(s) and 5(m) := (P n — 
P){l (s m ) — 7 ( s ))- It is sufficient to control pen — penj d for every m £ M n . 

We will thus use the concentration inequalities of Section IA.6I with x = 7 ln(n) and 
7 = 2 + aju- Define B n (m) = min,\6A m {np\}, and £l n the event on which 

• for every m £ A4 n , (|20p holds 

• for every m G M n such that B n {m) > 1, (gSJ) and d3D|) hold: 



pi{m) > E [pi(m)} - L( SH 5) 
p~i(m) < E [pi{m) \ + L ( sh5) 



LB„ {in) 




E[ P2 {m)} 



E[ P2 (m)] 
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for every m G M n such that B n (m) > 0, (J^TJ), (J28|) and [26] hold: 
77 1 ^(SH5) ln(n) 2 \ 



Pi(m) > 



2+( 7 + l) J B n (m)- 1 ln(n) 



e[kW] 



[pa (m) - E [p 2 (m) ] I < L(SH ^ (n) [£ ( a, s m ) + E [p 2 (m) ] ] 



|<5(m)| < 



S, Sr 



VDr, 
+ L 



rjy- ■ — (sh5) rjY 

From Proposition [11] (for pi), Proposition [10] (for P2) and Proposition [8] (for 5(rn)), 



ln(n) 



E[p 2 (m)] 



'(tt n )>l-L n- 2 ^ > 1 - L CA1 



m£M r 



For every m G M. n such that L> m < L c xn\n.(n) 1 , (Ar^) implies that B n (m) > 
L _1 ln(n) > 1. As a consequence, on O n , if ln(n) 7 < D m < L x reln(n) : 

L (SH5)E [£(s,s m ) +P2M] 



max 



{ |pi(m) - E [pi(m)]| , |p 2 (m) - E [p 2 (m)]| , |5(m)| } 



< 



ln(n) 



Using (|32|) (in Proposition fl~2l) and the fact that B n (m) > L 1 ln(n), 



(ci + c 2 ) 1 - 8, 



< E [pen(m)] < 



(d + C 2 ) 1 + S n 



■E [pi(m) + p 2 (m)] 



with < <5 n < Lln(n) 1//4 . We deduce: if n > £(sh5)> f° r every m G A4 n such that 
ln(n) 7 < D m < L c x nln(n) -1 , on fi n , 



{ c\ + c 2 - 2 ; 



-k(SH5) 

ln^n) 1 /* 



Pi(m) < (pen — pen id )(m) 



< 



(C! + C 2 -2) + + 



(SH5) 



pi{m) 



ln(n)V4 

We need to assume that n is large enough in order to upper bound E [p 2 (m)] in terms of 
pi(m), since we only have 



Pi(m) > 



1 



J (SH5) 



ln(n)V4 

in general. Combined with (|23p . this gives: if n > £(shs) 



E[p 2 (m)] 



£(S, S,^) lln(n) 5 <D ffi <L x nln(n)- 



1 < 



1 + (Ci + C 2 - 2)+ L (SH5 ) 



inf 



(ci + P2-l)Al ln(n)V4 
{^(s,? m )} . 



mG-M™ S.t. ln(n) 7 <D m <L x nln(n)- 1 

We now use Lemmas [6] and [7] below to control on Q, n the dimensions of the selected 
model fh and the oracle model m* G argmin m6 _A^ n {£(s,s m )}. 

The result follows since £(sh5) ln(n)^ 1 / 4 < e n = ln(n)~ 1//5 for n > L(shs)- We finally 
remove the condition n > no = £(SH5) by choosing .K3 = £(shs) such that K^Uq 2 > 1. 
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Classical oracle inequality Since (|2ip holds true on Q n , 

E[£(s,s ih )}=E[£(s,s fh )la n }+E[£(s,Sff l )tno\ 



< [2f7-l + e n ]E 

which proves ([22j) . ■ 



inf {£(s,s m )} 

meMn 



+ A 2 K 3 F(Q c n ] 



Lemma 6 (Control on the dimension of the selected model) Let c > and a > 

(1 — (3+) + /2. Then, if n > £(SH5),c,a; on the event fl n defined in the proof of Theorem^ 

ln(n) 7 < Df^ < n x l 2+a < craln(n)- 1 . 

Lemma 7 (Control on the dimension of the oracle model) Define the oracle model 
m* £ argmin meA 4 n {£(s,s m )}. Let c > and a > (1 - /3+) + /2. Then, if n > £(SH5),c,a> 
on the event £l n defined in the proof of Theorem 0, 

ln(n) 7 < D m ± < n 1/2+a < cnln(n)" 1 . 

Proof of Lemma [6] By definition, m minimizes crit(m) over A4 n . It thus also minimizes 

crit'(m) = crit(m) — P n j(s) = t (s, s m ) — P2(m) + 5(m) + pen(m) 

over M. n . 

1. Lower bound on crit'(m) for small models: let m £ M n such that D m < (ln(n)) . 
We then have 

£{s,s m ) > C- (\n{n)Y W ~ from (Ap) 
pen(m) > 



/ln(n) D m /ln(ra) . 

p 2 (m) < L ( sH5) Y + ^(SH5) — < £(SH5) v/ — ^— from (|27 

and from ([26]) (in Proposition [8]), 



5(m) > _ L Hs,s m )Hn) + ^Mn) ^ _^ /ln(n) 



n n V n 

We then have 

crit'(m) > L ( sh5) (ln(ra))~ L/3 - . 

2. Lower bound for large models: let m E 7W n such that Z) m > n 1//2+a . From (|20p and 
(1271) (in Proposition [10]), 



111 ( 77- J 

pen(m) - p 2 (m) > (c 2 - l)E[p 2 (m)] - La] 1 



n 

> (C 2 ~ iViingm _ L /H") 
n \ n 
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and from (|24|) . 



5(m) > -L( S H5)' ' Ll(/ ' 



n 



Hence, if D m > n 1 / 2+a and n > £(SH5),a 



crit'(m) > pen(m) + 5(m) - P2(m) > £(sH5),a n 1 / 2+a . 

3. There exists a better model for crit(m): from (P2), there exists mo £ M n such that 
V™ — D mo < Cj-ichy/n. If moreover n > L Cr . chjQ ,, then 

ln(n) 7 < < O m „ < c richV ^ < n 1 / 2 ^ . 

By (f33|) in Lemma [T3"1 vl n (mo) > 1 with probability at least 1 — Ln~ 2 . 
Using (Ap), 

so that, when n > L(sh5)> 

crit'(mo) < ^ (s, s mo ) + |<5(m)| + pen(m) 
<^(SH5) (n-P+l 2 + n- l l 2 



If n > ^(sh5),o) this upper bound is smaller than the previous lower bounds for small 
and large models. ■ 

Proof of Lemma [7] Recall that m* minimizes I (s, Sm ) — t (s, s m ) +pi (m) over m E A4 n , 
with the convention ^ ( s, s m ) = oo if j4 n (m) = 0. 

1. Lower bound on ^(s,s m ) for small models: let m £ M. n such that D m < (ln(n)) . 
From (Ap), we have 

t(s,s m )>e(s,s m )>C-(ln(n))- 7f3 - . 

2. Lower bound on £(s,'s m ) for large models: let m S Ai n such that D m > n l l 2+a . 
From for n > L (SH5) a , 

/ 

1 ^(SH5),a 



pi(m) > 



E[p 2 (m)] 



\2 + ( 7 + l)^) ln(n) 
so that £(s,s m ) > pi(m) > L^ SU5 ^ a n' 1/2+a . 

3. There exists a better model for £(s,'s m ): let mo £ A^ n be as in the proof of LemmaE] 
and assume that n > L Crichja . Then, 

Pi{m ) < L {SH5) E[p 2 (m)} < L( SH 5)™~ 1/2 
and the arguments of the previous proof show that 

Hs,s mo ) <L (SH5 ) (n-^ + n- 1 / 2 * 
which is smaller than the previous upper bounds for n > L(shs) 
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A. 5 Proof of Theorem [2] 



Similarly to the proof of Theorem [5l we consider the event 0,' n , of probability at least 
1 — L CM n~ 2 , on which: 

• for every m E M n , © (for pen), (f3Tj) (for p[), (f27|) (f28 ]) (for p 2 , with x = 7ln(n) and 
8 = y/\n(n)/n) and (|24j) (j26[) (for 5, with x = 7 ln(n) and r] = \Jhx(n)/n) hold true. 

• for every m € M n such that B n (m) > 1, ()29[) and (|30p hold (for pi). 
Lower bound on By definition, in minimizes 

crit'(m) = crit(m) — P n j(s) = £ (s, s m ) — P2(m) + 5(m) + pen(m) 

over m G A4 n such that A n (m) > 1. As in the proof of Theorem [5l we define c = L c x > 

such that for every model of dimension D m < cnln(n) , B n (m) > L _1 ln(n) > 1. Let 
d = min(c, Co) and d G (0, 1) a constant to be chosen later. 

1. Lower bound on crit'(m) for "small" models: assume that m G Ai n and Z) m < 
dc'nln(n) . Then, i(s,s m ) + pen(m) > and from (|24|) . 



*(m) > -L^ ' ln(n) 



n 

If D m > ln(n) 4 , K2SJ implies that 

I \ ^ f 1 , -k(SH2) \ r , Nl . £(SH2)Ara c'dL( SH 2) 

\ m(n) J n m[n) 
On the other hand, if D m < ln(n) 4 , §2T\\ implies that 



/ln(n) 

P2(m) <L (SH 2)y— ^— • 

We then have 

crit'(m) > -dL( SH 2) (M n ))~ • 
2. There exists a better model for crit(m): let m\ G Aa„ such that 



c'dn 



dn 



ln(n) 4 < — — < D mi < — — < n . 

c rich ln(n) ln(n) 

From (P2+), this is possible as soon as n > L Crich ,c',d- By (f33|) in Lemma[T3l A n (mo) > 
1 with probability at least 1 — Ln~ 2 . 
We then have 

£(SH2),c' hi(n)^+n ^+ by (Ap) 
P2(mi)> (l--^^)E[p 2 ( mi )) by dSHD 



ln(n) 

pen(mi) < KJZ [p2(mi) ] by ([7]) 



|*(roi)|<W— by|2 
n 
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so that 

£(SH2) "\ uTi r / Mir . / ln ( n ) 



crit'(mi) < L (SH 2), c 'ln(n)^n^+ + (k - 1 + ^^^j E[p 2 ( mi )] + L A ^ 

< (K-l + Lpm^Hn))- 1 )*^ 
21n(n) 

if n > L( SH 2) )C '- 

We now choose <i such that the constant dL( SH2 ) appearing in the lower bound on 
crit'(m) for "small" models is smaller than (1 — K — L( SH 2)( m (n)) _1 )<7^ lin c / /2, that 
is d < L(sh2)/- Then, we assume that n > uq = L(SH2),c',d = -^(SH2)- Finally, we 
remove this condition as before by enlarging K\. 

Risk of Dfn The proof of (JSj) is quite similar to the one of Lemma [7J First, for every 
model m G M n such that A n (m) > 1 and D m > K 2 n ln(n)" 1 , we have 



Hs,s m )> Pl {m)> L (SH2) K 2 ln(n)- z by (EJ . 
Then, the model mo G 7W n defined previously satisfies A n (m) > 1, and 

n^s mo ) <L (SH2 ) (n-^ + n- 1 / 2 ^ 
If n > L(SH2)> the ra ti° between these two bounds is larger than ln(n), so that (|8j) holds. ■ 
A. 6 Concentration inequalities used in the main proofs 

In this section, we no longer assume that each model is the set of piecewise constant functions 
on some partition of X. First, we control 5{m) with general models and bounded data. 

Proposition 8 Assume that WYW^ < A < 00. Then for all x > 0, on an event of proba- 
bility at least 1 — 2e~ x : 

Vt?>0, \S(m)\ <r 1 l{s,s m ) + + — . (24) 

\rj d I n 



If moreover 

on the same event, 



„c„\ nE\po(m)] 

Q (p) L^^J > ^ (25) 



Remark 9 (Regressogram case) If S m is the set of piecewise constant functions on 
some partition (I a)a6A °f^> 



D " 1 A -A 
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Then, we der ive a concentration inequality for P2{m) in the regressogram case from a 
general result by Boucheron and Massart ( 20081 ). 

Proposition 10 Let S m be the model of piecewise constant functions associated with the 
partition (-Ja)asA ■ Assume that ^ < A and define P2(m) = P n (7(s m ) — 7(sm))- 

Then, for every x > 0, there exists an event of probability at least 1 — e l ~ x on which for 
every 9 G (0; 1), 



\p2(m) - E [p2(m)]\ < L 



9i(s,s m ) + A2 ^^ + A2xl 



n 



On 



(27) 



for some absolute constant L. If moreover cr(X) > o" m i n > a.s., we have on the same 
event: 



\p2(m) - E [p 2 (m)]\ < 



L 



iTT, 



i(s,s m ) + 



A 2 E[p 2 (m)} 



(yfx + x) 



(28) 



Finally, we recall a concentration inequality for p\{m) proved by (jArlotl . l2008bl . Propo- 
sition 9). Its proof is particular to the regressogram case. 

Proposition 11 (Proposition 9, Arlot ( 2008bl )) Let 7 > and S m be the model of 
piecewise constant functions associated with the partition (I\)xeA • Assume that {{YW^ < 
A < 00, o~(X) > <7 m i n > a.s. and minAgA m { n P\} > B n > 0. Then, if B n > 1, on an 
event of probability at least 1 — Ln -7 , 



pi(m) > E [pi(m)] - ^A,<T min , 7 
pi(m) < E [pi(m)] + £A,<x min , 7 



ln(n) 2 + e _ LBn 



ln(n) 2 



+ VD m e 



E[pa(m)] 

E[pa(m)] . 



-LB n 



(29) 
(30) 



If we only have a lower bound B n > 0, then, with probability at least 1 — Ln 7 , 

1 L4 i(7min;7 ln(n) 2 



Pi(ra) > 



2 + (7 + ^Bn 1 ln(n) 



E[pa(m)] . 



(31) 



A. 7 Additional results needed 



A crucial result in the proofs of Theorems [5] and [2] is that p~\ (m) and p2(m) are close in 
expectation; the following proposition was proved by lArlotl (|2008bl . Lemma 7). 

Proposition 12 (Lemma 7, Arlot ( 2008bl )) Let S m be a model of piecewise constant 
functions adapted to some partition ( -^a )>eA • Assume that min^ g A m { n Px } > > 0. 
Then, 



(1 



E[p 2 (m)} < E[pl(m)] 



< 



2 A f 1 + 5.1 x B~ 1/4 ) + (B V 1) 



-(BVl) 



E[p 2 (m)] . 



(32) 



Finally, we need the following technical lemma in the proof of the main theorems. 
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Lemma 13 Let (p\)\<zA m be non-negative real numbers of sum 1, {np\)xeA m a multinomial 
vector of parameters (n; (px)xeA m )- Then, for all 7 > 0, 

. ^ min AG A m { np x } , , 1 >, M , . 

mm { npx } > 2(7 + 1) m(n) (33) 

AeA m 2 

with probability at least 1 — 2n~ 7 . 



Proof By Bernstein inequality (jMassartl . 120071 . Proposition 2.9), for all A € A m , 



np\ > (1 — 0)np\ — \/2npx — — J > 1 — e x . 

Take x = (7 + 1) ln(n) above, and remark that \J2npx < ^ + x. The union bound gives 
the result since Card(A m ) < n. ■ 



A. 8 Proof of Proposition [8] 

Since H^H^ < A, we have < A and ||st7i|Ioo — ^- In fact, everything happens as if 

S m U {s} was bounded by A in L°°. 
We have 

1 n 

5{m) = - V ( 7 (s m , (X^y)) " 7(s, (X,y)) " E frOm, (X,y)) - 7 (s, (X, y))]) 

T) ^ * 



8=1 



and assumptions of Bernstein inequality ( Massart . 20071 . Proposition 2.9) are fulfilled with 



8A 2 8A 2 £(s,s m ) 
c = — — and v = 



3n 



n 



since 



and 



|| 7 ( Sm , (Xi, Yi)) - 7 ( S , (Xi,Yi)) - E [ 7 ( Sm , (Xi,Yi)) - j(s, (X, y ))] ^ < 8A 2 

var ( 7 (s m , (Xi, y)) - 7 (s, (X, y))) < E [( 7 (s m , (X, y)) - 7 (s, (x, y))) 5 

< 8A 2 ^(s,s m ) 

because ||s m — < 2A and 

( 7 (t, •) - 7( S , -)) 2 = (*(*) - ^(A)) 2 (2(y - S (X)) - i(X) + s(X)) 2 
and E [(y - s(X)) 2 \ X] < = A . 

We obtain that, with probability at least 1 — 2e~ x , 



W R — j- / l6A^( s , Sm )x 8A 2 x 

o(m) < V2ua; + c = \/ 1 

1 \ ., g n 



n 



and ([21]) follows since 2y/ab < an + bn 1 for all rj > 0. Taking 77 = D rr }^ 2 < 1 and using 
defined by (ggj, we deduce ®. ■ 
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A. 9 Proof of Proposition 1101 



We apply here a result by iBoucheron and Massartl (|2008l . Theorem 2.2 in a preliminary 
version), in which it is only assumed that 7 takes its values in [0; 1]. This is satisfied when 
WYW^ < A = 1/2. When A / 1/2, we apply this result to (2A)~ 1 Y and recover the general 
result by homogeneity. 

First, we recall this result in the bounded least-squares regression framework. For every 
t : X \— ► R and e > 0, we define 



d 2 (s,t) = 2£(s,t) 



and 



10(e) = V2e . 



Let <p m belong to the class of nondecreasing and continuous functions / : R + 1—* R" 1 " such 
that x 1— > f(x)/x is nonincreasing on (0; +00) and /(l) > 1. Assume that for every u G S m 
and a > such that 4> m {a) < \/na 2 , 



nE 



sup |7n(«)-7 n (*)l 

t&Sm,d(u,t)<CT 



< 4>m{<?) 



(34) 



Let e* jm be the unique positive solution of the equation 

Vnel <rn = 4> m {w{£*,m)) ■ 
Then, there exists some absolute constant L such that for every real number q > 2 one has 



\\p2(m) - E[p 2 (m) 



< 



L 



2<? ( y/i(s,s m ) V e^ m ) + q 



(35) 



Using now that S m is the set of piecewise constant functions on some partition (I\ )x^A m 
of X, we can take 

(j) m {a) = ^V2^/lJ^ xa in ($M§. (36) 

The proof of this statement is made below. Then, E* jm = QyJ D m n~ l l 2 . 

Com bining (1351) with the classical link between moments and concentration (see for 
instance lArlotl . 120071 . Lemma 8.9), the first result follows. The second result is obtained by 
taking 9 = D m , as in Proposition [8j ■ 

Proof of flSBJ Let u G S m and d(u, t) = y/2 \\u(X) - t(X)\\ 2 for every t : X i-> R. Define 

V> : M + i-» M + by 



V'(cr) = E 



sup |(P n -P)( 7 (u > .)-7(t ) -))l 

d(u,t)<a,teS m 



such that 



We are looking for some nondecreasing and continuous function m : M 1 — ► 
4>m{x)/x is nonincreasing, (f> m (l) > 1 and for every u G 5 m , 

V<7 > such that <j> m (cr) < yfno 2 , m (cr) > -^nip(a) . 

We first look at a general upperbound on ■0. 

Assume that 

^ — %■ If this is not the case, the triangular inequality shows that 

^general u < ^u=s m ■ Let US Write 



t 



AeA„ 



AeA,, 
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Computation of P( 7 (i, •) — 7 (s m , •)) for some general t G S m : 
P( 7 (t, •) - 7 ( Sm , •)) = E [(t(X) - Yf - (s m (X) - Y) 2 ] 

= E [(t(X) - s m (X)) 2 } + 2E [(t(X) - s m (X))(s m (X) - s(X))\ 
= E [(t(X) - s m (X)) 2 ] 

= E Px(tx-Pxf 
xeA m 

since for every A G A m , E [s(X) \ X G I\] = (3\. 

Computation of P n ( 7 (i, •) — 7 (s m , •)) for some general t G S^: with 77^ = 1^ — s m (JQ), 
we have 

1 ™ 

Pn(7(*, •) " 7(*m, •)) = - E [(*(^) " ^ " ("(^) " ^ 



i=l 
n 



1 - . 9 — 

= - E(*™ - «(*)) 2 - - E [(*(*) - «(**))*] 
i=i i=i 

1 n 2 n 

= ~ E E (** ~ n A) 2 lx ie / A - - E E ( tx ~ u A)lx,e/ A ??i • 



i=l AeA m i=l AeA m 

Back to (P n — P) We sum the two inequalities above and use the triangular inequality: 



|(P n -P)( 7 (t,-)-7(«,-))| < 



n 

+ 
2A 



1 ™ 

-EE (*A-n A ) 2 (lx ie / A -PA) 



=1 AeA, 7 

n 



- E E ~ «A)ix ie /A^ 



i=l AeA„ 



<^ E 

AeA m 

+ -E 

AGA m 



(y^^A - «a|) 
(V^A |*A - U\\) 



\Ya=i 1 x,&i x rii\ 



since \t\ — u\\ <2A for every t G 5 m . 

We now assume that d(u, t) < a for some <r > 0, that is 

d{u,t) 2 = 2 E P\(t\ ~ u x ) 2 < a 2 . 
AeA m 

From Cauchy-Schwarz inequality, we obtain for every t G S m such that d(u, t) < a 



|(P n -P)( 7 (t,-)-7(«,-))| < 



2Aa 
\/2c7 



E 

aga„ 



c = i(i^-PA)r 

Pa 



+ 



n 



E 

ASA,, 



(E"=i 1 x i ei x ViY 



Px 
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Back to ifi The upper bound above does not depend on t, so that the left-hand side 
of the inequality can be replaced by a supremum over {t £ S m s.t. d(u,t) < a}. Taking 
expectations and using Jensen's inequality (y^ being concave), we obtain an upper bound 
on ip: 



2Aa 



AeA m 



V2n\ 

For every A G A m , we have 



Pa 



+ 



y/2, 



a 



n 



1 



AeA m 



(E"=i 1 x l ei x Vi 

PA 



(37) 



PaJ 



Pa, 



np\ ( 1 - p\ ] 



vi=l 



i=l 



which simplifies the first term. For the second term, notice that 

Mi / j, E [ixieix^XjelxViVj] = E [Ix.e/^^E [l Xj e/ A ??j 
and Vi, E[l Xi6/A » W ]=E[l Xi6 i A E[7ft| l x<e /J]=0 

since 77* is centered conditionally to ljc ig / v Then, 



(38) 



E ^Ix^m = ^E [lx ie / A r?f] < n Px \\nWl, < n Px {2Af 



. i=l 



i=l 



Combining ([37]) with (f38|) and ([39]) . we deduce that 
2Ar 



>/2y/n 



l + ^v^<3iV2-^ 



X <7 



(39) 



As already noticed, we have to multiply this bound by 2 so that it is valid for every u E S„ 
and not only u = s m . 

The resulting upper bound (multiplied by yjn) has all the desired properties for <fi n 
since 6AV2 y/D^ = 3\/2~D^ > 1. The result follows. ■ 



References 

Hirotugu Akaike. Statistical predictor identification. Ann. Inst. Statist. Math., 22:203-217, 
1970. 

Hirotugu Akaike. Information theory and an extension of the maximum likelihood principle. 
In Second International Symposium on Information Theory (Tsahkadsor, 1971), pages 
267-281. Akademiai Kiado, Budapest, 1973. 

David M. Allen. The relationship between variable selection and data augmentation and a 
method for prediction. Technometrics, 16:125-127, 1974. 



33 



Arlot and Massart 



Sylvain Arlot. Resampling and Model Selection. PhD thesis, University Paris-Sud 11, 
December 2007. oai:tel.archives-ouvertes.fr:tel-00198803_vl. 

Sylvain Arlot. Suboptimality of penalties proportional to the dimension for model selection 
in heteroscedastic regression, December 2008a. arXiv:0812.3141. 

Sylvain Arlot. F-fold cross-validation improved: V-fold penalization, February 2008b. 
arXiv:0802.0566v2. 

Sylvain Arlot. Model selection by resampling penalization, March 2008c. oai:hal.archives- 
ouvertes.fr :hal-00262478_vl. 

Yannick Baraud. Model selection for regression on a fixed design. Probab. Theory Related 
Fields, 117(4):467-493, 2000. 

Yannick Baraud. Model selection for regression on a random design. ES AIM Probab. Statist., 
6:127-146 (electronic), 2002. 

Yannick Baraud, Christophe Giraud, and Sylvie Huet. Gaussian model selection with un- 
known variance. To appear in The Annals of Statistics. arXiv:math. ST/0701250, 2007. 

Andrew Barron, Lucien Birge, and Pascal Massart. Risk bounds for model selection via 
penalization. Probab. Theory Related Fields, 113(3):301-413, 1999. 

Peter L. Bartlett, Stephane Boucheron, and Gabor Lugosi. Model selection and error 
estimation. Machine Learning, 48:85-113, 2002. 

Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher complexi- 
ties. Ann. Statist, 33(4): 1497-1537, 2005. 

Jean-Patrick Baudry. Clustering through model selection criteria. Poster session at One Day 
Statistical Workshop in Lisieux. http://www.math.u-psud.fr/~baudry, June 2007. 

Lucien Birge and Pascal Massart. Gaussian model selection. J. Eur. Math. Soc. (JEMS), 
3(3):203-268, 2001. 

Lucien Birge and Pascal Massart. Minimal penalties for Gaussian model selection. Probab. 
Theory Related Fields, 138(1-2) :33-73, 2007. 

Stephane Boucheron and Pascal Massart. A poor man's wilks phenomenon. Personal 
communication, March 2008. 

Prabir Burman. Estimation of equifrequency histograms. Statist. Probab. Lett., 56(3): 
227-238, 2002. 

Imre Csiszar. Large-scale typicality of Markov sample paths and consistency of MDL order 
estimators. IEEE Trans. Inform. Theory, 48(6): 1616-1628, 2002. 

Imre Csiszar and Paul C. Shields. The consistency of the BIC Markov order estimator. 
Ann. Statist, 28(6):1601-1619, 2000. 



34 



Data-driven Calibration of Penalties 



Luc Devroye and Gabor Lugosi. Combinatorial methods in density estimation. Springer 
Series in Statistics. Springer- Verlag, New York, 2001. 

Bradley Efron. Estimating the error rate of a prediction rule: improvement on cross- 
validation. J. Amer. Statist. Assoc., 78(382):316-331, 1983. 

Seymour Geisser. The predictive sample reuse method with applications. J. Amer. Statist. 
Assoc., 70:320-328, 1975. 

Edward I. George and Dean P. Foster. Calibration and empirical Bayes variable selection. 
Biometrika, 87(4):731-747, 2000. 

Clifford M. Hurvich and Chih-Ling Tsai. Regression and time series model selection in small 
samples. Biometrika, 76(2):297-307, 1989. 

Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Trans. 
Inform. Theory, 47(5):1902-1914, 2001. 

Vladimir Koltchinskii. Local Rademacher complexities and oracle inequalities in risk mini- 
mization. Ann. Statist, 34(6):2593-2656, 2006. 

Marc Lavielle. Using penalized contrasts for the change-point problem. Signal Proces., 85 
(8):1501-1510, 2005. 

Emilie Lebarbier. Detecting multiple change-points in the mean of a gaussian process by 
model selection. Signal Proces., 85:717-736, 2005. 

Guillaume Lecue. Methodes d'agregation : optimalite et vitesses rapides. PhD thesis, 
LPMA, University Paris VII, May 2007. 

Vincent Lepez. Some estimation problems related to oil reserves. PhD thesis, University 
Paris XI, 2002. 

Ker-Chau Li. Asymptotic optimality for C p , Cl, cross-validation and generalized cross- 
validation: discrete index set. Ann. Statist., 15(3):958-975, 1987. 

Fernando Lozano. Model selection using rademacher penalization. In Proceedings of the 
2nd ICSC Symp. on Neural Computation (NC2000). Berlin, Germany. ICSC Academic 
Press, 2000. 

Colin L. Mallows. Some comments on C p . Technometrics, 15:661-675, 1973. 

Pascal Massart. Concentration inequalities and model selection, volume 1896 of Lecture 
Notes in Mathematics. Springer, Berlin, 2007. 

Cathy Maugis and Bertrand Michel. A non asymptotic penalized criterion for gaussian 
mixture model selection. Technical Report 6549, INRIA, 2008. 

Boris T. Polyak and Alexandre B. Tsybakov. Asymptotic optimality of the C p -test in the 
projection estimation of a regression. Teor. Veroyatnost. i Primenen., 35(2):305-317, 
1990. 



35 



Arlot and Massart 



Xiaotong Shen and Jianming Ye. Adaptive model selection. J. Amer. Statist. Assoc., 97 
(457):210-221, 2002. 

Ritei Shibata. An optimal selection of regression variables. Biometrika, 68(l):45-54, 1981. 

Charles J. Stone. An asymptotically optimal histogram selection rule. In Proceedings of the 
Berkeley conference in honor of Jerzy Neyman and Jack Kiefer, Vol. II (Berkeley, Calif., 
1983), Wadsworth Statist. /Probab. Ser., pages 513-520, Belmont, CA, 1985. Wadsworth. 

M. Stone. Cross-validatory choice and assessment of statistical predictions. J. Roy. Statist. 
Soc. Ser. B, 36:111-147, 1974. 

Nariaki Sugiura. Further analysis of the data by akaike's information criterion and the finite 
corrections. Comm. Statist. A — Theory Methods, 7(1): 13-26, 1978. 

Alexandre B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Ann. 
Statist, 32(1): 135-166, 2004. 

Nicolas Verzelen. Gaussian graphical models and Model selection. PhD thesis, University 
Paris XI, December 2008. 

Fanny Villers. Tests et selection de modeles pour V analyse de donnees proteomiques et 
transcriptomiques. PhD thesis, University Paris XI, December 2007. 



3G 



